114

Binary Neural Architecture Search

Algorithm 10 Search process of DCP-NAS

Input: Training data, validation data

Parameter: Searching hyper-graph: G, M = 8, e(o(i,j)

m

) = 0 for all edges

Output: Optimized ˆα.

1: while DCP-NAS do

2:

while Training real-valued Parent do

3:

Search a temporary real-valued architecture p(w, α).

4:

Decoupled optimization from Eqs. 4.43 to 4.53.

5:

Generate the tangent direction ˜

f(w)

∂α

from Eqs. 4.21 to 4.29.

6:

end while

7:

Architecture inheriting ˆαα.

8:

while Training 1-bit Child do

9:

Calculate the learning objective from Eqs. 4.26 to 4.32.

10:

Tangent propagation from Eqs. 4.33 to 4.41 and decoupled optimization from Eqs.

4.43 to 4.53.

11:

Obtain the ˆp( ˆw, ˆα).

12:

end while

13:

Architecture inheriting αˆα.

14: end while

15: return Optimized architecture ˆα.

where

represents

the

Hadamard

product

and

η

=

η1η2.

We

take

ψt

=

[M

m=1

E

e=1 ˜ge

wm

∂αe,m , · · · , M

m=1

E

e=1 ˜ge

wm

∂αe,m ]T . Note that,

w

∂α is unsolvable and

has no explicit form in NAS, which causes an unsolvable ψt. Thus we introduce a learnable

parameter ˜ψt for approximating ψt, which back-propagation process is calculated as

˜ψt+1 = | ˜ψt ηψ

L

˜ψt |.

(4.51)

Eq. 4.50 shows that our method is based on a projection function to solve the opti-

mization coupling problem by the learnable parameter ˜ψt. In this method, we consider the

influence of αt and backtrack the optimized state in the (t + 1)-th step to form ˜αt+1. How-

ever, the key point in optimization is where and when the backtracking should be applied.

Thus, we define the update rule as

˜αt+1

m

=

P(αt+1

:,m , αt

:,m),

if ranking(R(wm)) > τ

˜αt+1

:,m ,

otherwise

(4.52)

where P(αt+1

:,m , αt

:,m) = αt+1

:,m + η ˜ψtαt

:,m and the subscript ·m denotes a specific edge.

R(wm) denotes the norm constraint of wm and is further defined as

R(wm) =wm2

2,m = 1, · · · , M,

(4.53)

where τ denotes the threshold for deciding whether or not to backtrack. We further define

the threshold as follows.

τ =ϵ · M

(4.54)

where ϵ denotes a hyperparameter to control the percentage of edge backtracking. With

backtracking α, the supernet can learn to jump out of the local minima. The general process

of DCP-NAS is described in Algorithm 10. Note that the decoupled optimization can be